Spoken Term Discovery for Language Documentation using Translations

نویسندگان

  • Antonios Anastasopoulos
  • Sameer Bansal
  • David Chiang
  • Sharon Goldwater
  • Adam Lopez
چکیده

Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-totranslation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a SpanishEnglish speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentatio...

متن کامل

An Attentional Model for Speech Translation Without Transcription

For many low-resource languages, spoken language resources are more likely to be annotated with translations than transcriptions. This bilingual speech data can be used for word-spotting, spoken document retrieval, and even for documentation of endangered languages. We experiment with the neural, attentional model applied to this data. On phoneto-word alignment and translation reranking tasks, ...

متن کامل

Multilingual Spoken Language Understanding using graphs and multiple translations

In this paper, we present an approach to multilingual Spoken Language Understanding based on a process of generalization of multiple translations, followed by a specific methodology to perform a semantic parsing of these combined translations. A statistical semantic model, which is learned from a segmented and labeled corpus, is used to represent the semantics of the task in a language. Our goa...

متن کامل

Adult’s Learning Strategies for Receptive Skill Self-managing or Teacher-managing

Receptive language skill refers to answering appropriately to another person's spoken language. A lot of teachers try to develop receptive language skills in their language learners. When receptive language skills are not appropriately acquired, learners may miss significant learning opportunities resulting in delays in the development and acquisition of spoken language. The goals of this paper...

متن کامل

A case study on using speech-to-translation alignments for language documentation

For many low-resource or endangered languages, spoken language resources are more likely to be annotated with translations than with transcriptions. Recent work exploits such annotations to produce speech-to-translation alignments, without access to any text transcriptions. We investigate whether providing such information can aid in producing better (mismatched) crowdsourced transcriptions, wh...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017